Finetuning Qwen2.5-3B using DPO with Unsloth

Finetuning Qwen2.5-3B with DPO using Unsloth on TinyStories prefrence dataset

Finetuning
DPO
Unsloth
Qwen
Author

Quang T. Duong

Published

August 24, 2024

Getting started GenAI & LLM with my Udemy course, Hands-on Generative AI Engineering with Large Language Model 👇

title: “Finetuning Qwen2.5-3B using DPO with Unsloth” ## Introduction

Qwen2.5-3B is a large-scale, pretrained language model comprising 3.09 billion parameters. This model provides a balance between expressiveness and computational feasibility, making it well-suited for finetuning using more specialized optimization strategies. In this guide, we will explore using Direct Preference Optimization (DPO) within the Unsloth framework to fine-tune Qwen2.5-3B. This approach focuses on aligning model responses to specific preferences and behaviors, yielding fine-tuned models that can respond to tasks more accurately.

Fine Tuning with DPO

Direct Preference Optimization (DPO) is a technique used for fine-tuning language models in scenarios where it is critical to optimize for specific outputs based on preferences, such as ranking user responses. By using DPO with the LoRA (Low-Rank Adaptation) technique, we can leverage efficient finetuning by only modifying a small subset of the model’s parameters. This keeps training costs low while maintaining flexibility in the model’s outputs.

Use Case

For this example, our use-case involves using DPO to fine-tune Qwen2.5-3B to generate engaging Tiny Stories tailored for children. By providing prompts and desired responses (and contrasting rejected responses), we can better shape the model to deliver coherent, engaging, and task-appropriate stories for various instructions.

Implementation

Step 1: Import Necessary Libraries

from unsloth import PatchDPOTrainer
PatchDPOTrainer()

import os
import torch
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import DPOConfig, DPOTrainer
from google.colab import userdata

Step 2: Initialize Comet ML for Experiment Tracking

import comet_ml
comet_ml.login(project_name="dpo-lora-unsloth")

Step 3: Load Pretrained Model and Tokenizer

max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B",
    max_seq_length=max_seq_length,
    load_in_4bit=False,
    )

Step 4: Apply LoRA Adaptation

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=32,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    )

Step 5: Dataset Preparation

Format the dataset using a specific template and split it for training and testing.

alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
"""

EOS_TOKEN = tokenizer.eos_token
def format_samples(example):
    example["prompt"] = alpaca_template.format(example["prompt"])
    example["chosen"] = example['chosen'] + EOS_TOKEN
    example["rejected"] = example['rejected'] + EOS_TOKEN
    return {
        "prompt": example["prompt"],
        "chosen": example["chosen"],
        "rejected": example["rejected"]
    }
dataset = dataset.map(format_samples)
dataset = dataset.train_test_split(test_size=0.05)

Step 6: Training Using DPOTrainer

Configure and train the model using the DPOTrainer class.

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    tokenizer=tokenizer,
    beta=0.5,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    max_length=max_seq_length//2,
    max_prompt_length=max_seq_length//2,
    args=DPOConfig(
        learning_rate=2e-6,
        lr_scheduler_type="linear",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=1,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        optim="adamw_8bit",
        weight_decay=0.01,
        warmup_steps=10,
        output_dir="output",
        eval_strategy="steps",
        eval_steps=0.2,
        logging_steps=1,
        report_to="comet_ml",
        seed=0,
        ),
)


trainer.train()

Step 7: Model Inference

Generate a response using the fine-tuned model.

FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)

Step 8: Save and Push to Hugging Face Hub

from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))

model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("tanquangduong/Qwen2.5-3B-DPO-TinyStories", tokenizer, save_method="merged_16bit")

Inference

Using the fine-tuned model for generating outputs:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tanquangduong/Qwen2.5-3B-DPO-TinyStories")
model = AutoModelForCausalLM.from_pretrained("tanquangduong/Qwen2.5-3B-DPO-TinyStories")
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}"""

model = model.to("cuda")
FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)

Conclusion

In this guide, we demonstrated how to fine-tune Qwen2.5-3B using Direct Preference Optimization (DPO) within the Unsloth framework. By leveraging LoRA for parameter-efficient adaptation, we tailored the model’s output behavior to better suit our target use case of generating child-friendly Tiny Stories. This methodology highlights the effectiveness of combining DPO and LoRA to achieve powerful, specialized fine-tuned models.